Method and System for Modeling Likely Invariants in Distributed Systems

ABSTRACT

Disclosed is a method and system for modeling invariant relationships between flow intensity measurements in a distributed system. In the method, a measurement is randomly selected from a plurality of flow intensity measurements. The method searched for relationships between the randomly selected measurement and each remaining one of the plurality of flow intensity measurements, and each of the flow intensity measurements having a relationship with the randomly selected measurement is grouped into a cluster with the randomly selected measurement. The method than determines relationships between all of the flow intensity measurements in the cluster. This method is repeated with the remaining flow intensity measurements until all of the flow intensity measurements are grouped into a cluster.

This application claims the benefit of U.S. Provisional Application No.60/820,831 filed Jul. 31, 2006, the disclosure of which is hereinincorporated by reference.

BACKGROUND OF THE INVENTION

The invention relates generally to the field of fault detection andlocalization in complex systems. More specifically, embodiments of theinvention relate to methods and systems for automatically modelingtransaction flow dynamics in distributed transaction systems for faultdetection and localization.

Today, numerous Internet services such as Amazon, eBay and Google, etc.are changing traditional business models. With the abundance of Internetservices, there are unprecedented needs to ensure their operationalavailability and reliability. Minutes of service downtime can lead tosevere revenue loss and user dissatisfaction.

An information system for an Internet service is typically large,dynamic, and distributed and can comprise thousands of individualhardware and software components. A single failure in one component,whether hardware or software related, can cause an entire system to beunavailable. Studies have shown that the time taken to detect, localize,and isolate faults accounts for a large portion of the time to recoverfrom a failure.

Transaction systems with user requests, such as Internet services andothers, receive large numbers of transaction requests from userseveryday. These requests flow through sets of components according tospecific application software logic. With such a large volume of uservisits, it is unrealistic to monitor and analyze each individual userrequest.

Data from software log files, system audit events, network trafficstatistics, etc., can be collected from system components and used forfault analysis. Since operational systems are dynamic, this data isreflective of the status of their internal states, including faults.However, given the distributed nature of information systems, evidenceof fault occurrence is often scattered among the monitored data.

Advanced monitoring and management tools for system administrators tointerpret monitoring data are available. IBM Tivoli, HP OpenView, andEMC InCharge suite are some of the products in the growing market ofsystem management software. Most current tools support some form of datapreprocessing and enable users to view the data with visualizationfunctions. These tools are useful for a system administrator since it isimpracticable to manually scan a large amount of monitoring data.However, these tools employ simple rule-based correlation with littleembedded intelligence for reasoning.

Rule-based tools generate alerts based on violations of predeterminedthreshold values. Rule-based systems are therefore stateless and do notmanage dynamic data analysis well. The lack of intelligence results fromthe difficulty in characterizing the dynamic behavior of complexsystems. Characterization across complex systems is inherentlysystem-dependent in that it is difficult to generalize across systemswith different architectures and functionality.

Detection and diagnosis of faults in complex information systems is aformidable task. Current approaches for fault diagnosis use eventcorrelation which collects and correlates events to locate faults basedon known dependencies between faults and symptoms. Due to the diversityof runtime environments, many faults experienced in an interconnectedsystem are not very well understood. As a result, it is difficult toobtain precise fault-symptom dependencies.

One attempt at understanding relationships between system faults andsymptoms was performed by the Berkeley/Stanford Recovery-OrientedComputing (ROC) group. JBoss middleware was modified to monitor tracesin J2EE (Java2 Enterprise Edition) platforms. JBoss is an open sourceJ2EE based application server implemented in pure Java. J2EE is aprogramming platform for developing and running distributed multi-tierarchitecture applications, based largely on modular components runningon an application server. Two methods were developed to collect tracesfor fault detection and diagnosis. However, with the huge volume of uservisits, monitoring, collecting and analyzing the trace of every userrequest was problematic. Most methods of collecting user request tracesresults in a large monitoring overhead.

It is a major challenge for system administrators to detect and isolatefaults effectively in large and complex systems. The challenge is how tocorrelate the collected data effectively across a distributed system forobservation, fault detection and identification. It is thereforedesirable to develop a method and system that considers the masscharacteristics of user requests in complex systems and hasself-cognition capability to aid in fault analysis.

BRIEF SUMMARY OF THE INVENTION

The present invention provides a method and system capable ofautomatically modeling transaction flow dynamics in distributed systems.The present invention models relationships between flow intensitymeasurements in distributed systems. Flow intensity is defined herein asthe intensity with which internal monitoring data of a distributedsystem reacts to the volume of user requests. The present inventionmodels relationships between flow intensities at various points in adistributed system in order to extract invariant relationships which donot change in response to external changes.

According to embodiments of the present invention, one measurement israndomly selected from a plurality of flow intensity measurements. Therelationships between that measurement and each remaining one of theplurality of flow intensity measurements are searched, and each of theflow intensity measurements having a relationship with the randomlyselected measurement is grouped into a cluster with the randomlyselected measurement. Relationships are then determined between all ofthe flow intensity measurements in the cluster. These steps are repeatedwith the remaining flow intensity measurements until all of the flowintensity measurements are grouped into a cluster.

In one embodiment of the present invention, the relationships betweenall of the flow intensity measurements grouped into a cluster aredetermined by modeling the relationships based on the flow intensitymeasurements. For example, an AutoRegressive model with eXogenous inputs(ARX) can be used to model these relationships. In another embodiment ofthe present invention, the relationships between all of the flowintensity measurements grouped into a cluster are mathematically derivedfrom the relationship between the selected measurement and each of theflow intensity measurements in the cluster.

These and other advantages of the invention will be apparent to those ofordinary skill in the art by reference to the following detaileddescription and the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary distributed transaction system;

FIG. 2 illustrates a method for extracting invariant relationshipsbetween flow intensity measurements in a distributed system;

FIG. 3 illustrates pseudo-code for a FullMesh method of extractinginvariant relationships between flow intensity measurements;

FIG. 4 illustrates sequential validation of modeled relationships;

FIG. 5 illustrates an exemplary invariant based fault detection andisolation method;

FIG. 6A illustrates a method of modeling relationships between flowintensity measurements in a distributed system according to anembodiment of the present invention;

FIG. 6B illustrates pseudo code of the method shown in FIG. 6A;

FIG. 7 illustrates an example of the method shown in FIG. 6A;

FIG. 8A illustrates a method of modeling relationships between flowintensity measurements in a distributed system according to anotherembodiment of the present invention;

FIG. 8B is pseudo code of the method shown in FIG. 8A;

FIG. 9 illustrates an example of graph-based reasoning in faultisolation; and

FIG. 10 shows a high level block diagram of a computer capable ofimplementing embodiments of the present invention.

DETAILED DESCRIPTION

Embodiments of the present invention are directed to modeling invariantrelationships among flow intensity measurements calculated based onmonitoring data collected at various points in a distributed system. Itshould be noted that the invention is not limited to any particularsoftware language described or implied in the figures. One of skill inthe art will understand that a variety of alternative software languagesmay be used for implementation of the invention. It should also beunderstood that some components and items are illustrated and describedas if they were hardware elements, as is common practice within the art.However, one of ordinary skill in the art, and based on a reading ofthis detailed description, would understand that, in at least oneembodiment, components in the method and system may be implemented insoftware or hardware.

A distributed transaction system is a structure in which networkresources, such as switching equipment and processors are distributedthroughout a geographical area being served and where processing isshared by many different parts of the network. FIG. 1 illustrates anexemplary distributed transaction system 100. Processing may be sharedby client (local) computers 105, file servers, Web servers, applicationservers and database servers. Switching may be performed by electronic,optical, or electromechanical devices. The capability of individualcomputers being linked together as a network is familiar to one skilledin the art.

Most distributed transaction systems, such as Internet services, employmulti-tier architectures to integrate their components. Referring toFIG. 1, a typical three-tier architecture is shown which includes Webservers (Web tier) 110, application servers (middleware) 115, anddatabase servers (database tier) 120. Individual computers 105 at aplurality of locations can communicate with a plurality of Web servers110 via a communication network 125, such as the Internet. The Webservers 110 communicate with other servers, such as application servers115 and database servers 120.

The Web server 110 acts as an interface, or gateway, to present data toa client's browser. The application server 115 supports specificbusiness, or application logic for various applications which generallyincludes the bulk of an application. The back-end database server 120 isused for data storage.

Each tier can be built from a number of software packages running onservers (computers) which provide similar functionality. For example,such software packets may include Apache and IIS for the Web server,WebLogic and WebSphere for the application server, Oracle and DB2 forthe database server, and others.

Distributed transaction systems can receive millions of user requestsper day. User requests traverse a set of components and software pathsaccording to application logic. During system operation, various systemcomponents produce large amounts of monitoring data, such as log files,to track their operational status. Much of the internal monitoring dataof a system reacts to the volume of user requests. Thus, values of themonitoring data increase or decrease in accordance with the volume ofuser requests. Flow intensity is the intensity with which internalmonitoring data of a system reacts to the volume of user requests. Forexample, the number of HTTP (Hypertext Transfer Protocol) requeststrings and SQL (Structured Query Language) queries measured overdiscrete sampling time periods can be used as flow intensitymeasurements. Measurements other than HTTP requests and SQL queries canalso be used as well. Multiple flow intensity measurements can beacquired from one single monitoring point in the system. For example, aWeb server access log can be used to derive the flow intensitymeasurements of all HTTP requests made through that server as well asone specific HTTP request.

Since flow intensity measurements at different points in a systemrespond similarly to the volume of user requests, there may exist astrong correlation or cointegration between many flow intensitymeasurements in a system. The correlation between flow intensitiesmeasured at the input and output of a component in a system couldreflect constraints that the component should bear. As an engineeredsystem, a distributed system imposes many constraints on therelationships between flow intensity measurements. Such constraintscould result from a variety of factors such as hardware capacity,application software logic, system architecture and functionality.Because of such constraints, many relationships between flow intensitymeasurements are always constant, regardless of changes in flowintensities in response to varying user loads. Thus, if a relationshipbetween flow intensities holds over time, the relationship can beconsidered an invariant relationship or an invariant of the underlyingsystem. Invariant relationships widely exist in distributed systems, andare governed by the physical properties or logical constraints of systemcomponents.

According to embodiments of the present invention, it is possible tomodel flow intensity measurements in a distributed system andrelationships between flow intensity measurements. For convenience,variables such as x and y are used herein to represent flow intensitymeasurements and equations such as y=f(x) are used to representinvariants. As described above, many flow intensity measurements changein response to changes in the volume of user requests in the system. Asa time series, flow intensity measurements that correlate to each othershould have similar evolving curves with the workload along time t.Therefore, many flow intensities should have linear relationships witheach other. Accordingly, embodiments of the present invention useAutoRegressive models with eXogenous inputs (ARX) to learn linearrelationships between flow intensity measurements.

At time t, the flow intensity measurements measured at the input andoutput of a component (or segment) of a distributed system can bedenoted as x(t) and y(t), respectively. A segment of a distributedsystem is a path in a distributed system between two flow intensitymeasurements, and can include multiple components. The ARX modeldescribes the following relationship between two flow intensities x(t)and y(t):

y(t)+a ₁ y(t−1)+ . . . +a _(n) y(t−n)=b ₀ x(t−k)+ . . . +b _(m)x(t−k−m),  (1)

where [n,m,k] is the order of the model, which determines how manyprevious steps are affecting the current output, and a_(i) and b_(j) arecoefficient parameters that reflect how strongly a previous step isaffecting the current output. If we denote:

θ=[a₁, . . . ,a_(n),b₀, . . . ,b_(m)]^(T)  (2)

α(t)=[−y(t−1), . . . ,−y(t−n),x(t−k), . . . ,x(t−k−m)],  (3)

then Equation (1) can be rewritten as:

y(t)=α(t)^(T)θ.  (4)

Assuming that the two flow intensity measurements x(t) and y(t) areobserved over a time interval 1≦t≦N, this observation can be denoted as:

O _(N) ={x(1),y(1), . . . ,x(N),y(N)}.  (5)

For a given θ, the observed inputs x(t), can be used to calculatesimulated outputs ŷ(t|θ) according Equation (1). Thus, an estimationerror can be calculated by comparing the simulated outputs with realobserved outputs. The estimation error can be defined as:

$\begin{matrix}\begin{matrix}{{E_{N}\left( {\theta,O_{N}} \right)} = {\frac{1}{N}{\sum\limits_{t = 1}^{N}\left( {{y(t)} - {\hat{y}\left( {t\text{}\theta} \right)}} \right)^{2}}}} \\{= {\frac{1}{N}{\sum\limits_{t = 1}^{N}{\left( {{y(t)} - {{\alpha (t)}^{T}\theta}} \right)^{2}.}}}}\end{matrix} & (6)\end{matrix}$

Using the Least Squares Method (LSM), it is possible to find 6 thatminimizes the estimation error E_(N)(θ,O_(N)) as follows:

$\begin{matrix}{\theta_{N} = {\left\lbrack {\sum\limits_{t = 1}^{N}{{\alpha (t)}{\alpha (t)}^{T}}} \right\rbrack^{- 1}{\sum\limits_{t = 1}^{N}{{\alpha (t)}{{y(t)}.}}}}} & (7)\end{matrix}$

Note that the ARX model can also be used to model the relationshipbetween multiple inputs and multiple outputs. Accordingly, it ispossible to have multiple flow intensities as the inputs and/or outputsin Equation (1). For simplicity, the correlation between two flowintensities is modeled and analyzed herein, however, the same methodscan be applied with multiple flow intensities as inputs and or outputs.

There are several possible criteria that can be used to evaluate how thelearned model fits the real observation. According to an embodiment ofthe present invention, a normalized fitness score can be calculated formodel validation using the following equation:

$\begin{matrix}{{{F(\theta)} = {\left\lbrack {1 - \sqrt{\frac{\sum\limits_{t = 1}^{N}{{{y(t)} - {\hat{y}\left( {t\text{}\theta} \right)}}}^{2}}{\sum\limits_{t = 1}^{N}{{{y(t)} - \overset{\_}{y}}}^{2}}}} \right\rbrack \cdot 100}},} & (8)\end{matrix}$

where y is the mean of the real output y(t). Equation (8) introduces ametric to evaluate how well the learned model approximates the realdata. A higher fitness score indicates that the model fits the observeddata better, and the upper bound for the fitness score is 100. Given theobservation of two flow intensities, Equation (7) can always be usedlearn a model even if this model does not reflect a meaningfulrelationship between the flow intensities. Therefore, only a model witha high fitness score is meaningful in characterizing relationshipsbetween flow intensity measurements. A range can be set for the order[n,m,k] rather than a fixed number, in order to learn a list of modelcandidates. The model with the highest fitness score of the list ofmodel candidates can be chosen from the list of model candidates tocharacterize the flow intensity relationship. As described herein, theARX model is used to learn the long-run relationship between two flowintensity measurements, i.e., a model y=f(x) only captures the maincharacteristics of the relationship. The precise relationship betweentwo flow intensity measurements can be represented as y=f(x)+ε where εis a modeling error. Note that ε is very small for a model with a highfitness score. Accordingly, such models with high fitness scores can beconsidered likely invariants.

Given the model template shown in Equation (1), it is possible to learna model instance between two flow intensity measurements. It is possibleto collect many flow intensity measurements in a complex distributesystem, but not all pairs of the flow intensity measurements have suchlinear relationships. In addition, due to system dynamics anduncertainties, some learned models may not be robust over a period oftime. Embodiments of the present invention are directed to extractedinvariant relationships between flow intensity measurements in adistributed system. FIG. 2 illustrates a method for extracting invariantrelationships between flow intensity measurements in a distributedsystem. At step 210 relationships between flow intensity measurementsare modeled, and at step 220 the relationships between the flowintensity measurements are validated to determine which of the modeledrelationships are invariants.

The present inventors introduced a method for modeling and validatingrelationships between flow intensity measurements in U.S. patentapplication Ser. No. 11/275,796, which is incorporated herein byreference. This method will be referred to herein as the “FullMesh”method. FIG. 3 illustrates pseudo-code for the FullMesh method. Assumethat there are n flow intensity measurements denoted by I_(i), 1≦i≦n.Since there many be little or no knowledge about the relationships ofthe flow intensity measurements in a specific system, the FullMeshmethod attempts to model a relationship between each of the flowintensity measurements (step 210 of FIG. 2), and then validate whetherthe models fit new monitoring data (step 220 of FIG. 2). Accordingly, asillustrated in FIG. 3, the FullMesh method comprises two parts:invariant search 302 and invariant validation 304. Part I 302 of theFullMesh method searches for a model for any two flow intensitymeasurements and then Part II 304 sequentially validates these modelswith new monitoring data. As illustrated in FIG. 3, Part I 302 of theFullMesh method searches for a relationship between every combination ofintensity measurements I_(i) and I_(j) in the set of flow intensitymeasurements using Equation (7). The fitness score F_(i)(θ) given byEquation (8) is used to evaluate how well a learned model matches thedata observed during the i^(th) time window. The length of this timewindow is denoted as L, i.e., the window includes L sampling points. Athreshold {tilde over (F)} is selected to determine whether the fitnessscore F_(i)(θ) is high enough for the model to fit the data. Thefollowing piecewise function can be used to determine whether a modelfits the data:

$\begin{matrix}{{f\left( {F_{i}(\theta)} \right)} = \left\{ {\begin{matrix}1 & {{{if}\mspace{14mu} {F_{i}(\theta)}} > \overset{\sim}{F}} \\0 & {{{if}\mspace{14mu} {F_{i}(\theta)}} \leq \overset{\sim}{F}}\end{matrix}.} \right.} & (9)\end{matrix}$

After receiving monitoring data for k windows, i.e., total of k·Lsampling points, a confidence score p_(k) can be calculated for a modelθ using the following equation:

$\begin{matrix}{{p_{k}(\theta)} = {{{prob}\left( {{F_{t}(\theta)} > \overset{\sim}{F}} \right)} = {\frac{\sum\limits_{i = 1}^{k}{f\left( {F_{i}(\theta)} \right)}}{k} = {\frac{{{p_{k - 1}(\theta)} \cdot \left( {k - 1} \right)} + {f\left( {F_{k}(\theta)} \right)}}{k}.}}}} & (10)\end{matrix}$

The set of valid models at time t=k·L can be denoted as M_(k), such thatM_(k)={θ|p_(k)(θ)>P}. P is the confidence threshold selected todetermine whether a model is an invariant. Thus, in Part II 304 of theFullMesh method, after a time period K, if the confidence score of amodel is less than the selected confidence threshold P, the model isconsidered invalid, and no longer has to be validated. FIG. 4illustrates sequential validation of modeled relationships. Asillustrated in FIG. 4, observation data 404 including flow intensitymeasurements is collected at various points in a distributed system 402.At the time t₀, a model is learned between two flow intensitymeasurements based on the observation data at t₀. At each time windowthereafter, the learned model is tested based on the new observationdata 404 collected from the distributed system 402, and if the modeldoes not hold, it is dropped and not considered an invariant. If themodel holds for k time windows, the model is considered an invariant.

The FullMesh method performs step 210 of FIG. 2 in Part I 302, andperforms step 220 of FIG. 2 in Part II 304. The computational complexityof the FullMesh method is greater in Part I than in Part II. Given nflow intensity measurements, Part I needs to construct n(n−1)/2 modelsas the candidates of invariants. Note that given two flow intensitymeasurements, logically we do not know which one should be chosen as theinput or the output in complex systems. Therefore, the FullMesh method,as well as the embodiments of the present invention described below,construct two models (with reverse input and output) for each pair offlow intensity measurements, and select the model with the highestfitness score as the invariant candidate. If the two learned models havevery different fitness scores, an AutoRegressive (AR) model must havebeen constructed instead of an ARX model. Since we are only interestedin strong correlation between two measurements, it is possible to filterout those AR models by requesting high fitness scores in both models.

The computational complexity of Part I of the FullMesh method is alsohigher than Part II because Part I requires the matrix inverse operationin Equation (7). Though Part I of the FullMesh method can run offline tosearch invariants, it still may not scale well in large distributedsystems. Thus, embodiments of the present invention described below aredirected to methods of modeling relationships between flow intensitymeasurements while reducing computational complexity. The methodsdescribed in these embodiments approximate Part I of the FullMesh methodand can be used to perform step 210 of FIG. 2. Part II of the FullMeshmethod can be used with each the embodiments of the present invention tovalidate (step 220 of FIG. 2) the models constructed by the embodimentsof the present invention.

According to an embodiment of the present invention, it is possible totrack any changes to the invariants extracted using the method of FIG. 2for fault detection in complex distributed systems. FIG. 5 illustratesan exemplary invariant based fault detection and isolation (FDI) method.As illustrated in FIG. 5, flow intensities x and y are measured at theinput and output of a system component 502, respectively, and aninvariant relationship 504 of y=f(x) is derived between the flowintensity measurements x and y. At time t, y_(t) represents the actualobserved output of the component 502 at time t, and y _(t) is thesimulated output from the invariant 504, i.e., y _(t)=f(x_(t)), wherex_(t) is the observed input at time t. A residual is calculated (506) asR_(t)=|y_(t)− y _(t)|. When no fault occurs, it should be the case thatthe residual R_(t)≦ε_(M), where ε_(M) is the threshold of modelingerror. If a fault occurs within this component 502 at time t, it mayaffect this relationship and the invariant is likely to be violated.Therefore, at each time t, it is checked whether R_(t)≦ε_(M). If R_(t)is greater than ε_(M), then a fault has occurred in component 502 attime t. As illustrated in FIG. 5, this fault detection can be performedsimultaneously at each time t for n components and invariants. In FIG.5, a “component” 502 represents a segment of the monitored system whichhas input x and output y. The residuals generated from of each of the ninvariants can be correlated (508) in order to isolate faults in thecomponents.

Based on various physical meanings of flow intensity measurements,invariants could characterize target systems from many differentperspectives. With various combinations of many invariants extractedfrom distributed systems, it is possible to detect a wide class offaults in complex systems, including operator faults, software faults,hardware faults, and networking faults. Based on the dependencyrelationships between invariants and their monitoring components, it isalso possible to isolate faults by correlating broken invariants. Inaddition, the combination of invariants can also be used to capture,index, and retrieve system history. For example, a binary vector can beused to represent the status of all invariants (broken or not), in orderto build profiles or signatures for various faults. Thus, when a faultis identified and isolated (508 of FIG. 5), the fault can becharacterized into a known or unknown fault.

Although fault detection and isolation is described herein as a possibleuse of invariants in distributed systems, the present invention is notlimited to fault detection and isolation. For example, the inventivetechnique can be used in a wide range of measurement technologies todetermine when measured results diverge from optimal results. Inaddition, invariants can also be used to support other system managementtasks, such as service profiling, performance debugging, capacityplanning, resource optimization, situation awareness, etc.

FIG. 6A illustrates a method of modeling relationships between flowintensity measurements in a distributed system according to anembodiment of the present invention. FIG. 6B is pseudo code of themethod of FIG. 6A. One skilled in the art will readily be able toassociate portions of the pseudo code of FIG. 6B with the steps of FIG.6A. For convenience, the method of FIGS. 6A and 6B is referred to hereinas the “SmartMesh” method. Rather than searching all possible n(n−1)/2relationships, as in the FullMesh method, the SmartMesh methodsequentially extracts invariants from clusters that are determined bysearch results from previous steps of the method. As described above,the SmartMesh method can be used to perform step 210 of the method ofFIG. 2.

The set of all measurements can be denoted as C={I_(i)}, 1≦i≦n, and anempty set denoted as φ. At step 602, one flow intensity measurementI_(i) is randomly selected from the set C of flow intensity measurementsas a reference measurement. This step is shown at 652 in the pseudo codeof FIG. 6B.

At step 604, the SmartMesh method searches for relationships between thereference measurement I_(i) and all of the other flow intensitymeasurements in the set C. This step is shown at 654 of the pseudo codeof FIG. 6B. As illustrated at 654 of the pseudo code of FIG. 6B, a modelθ is learned using Equation (7) for the relationship between thereference measurement I_(i) and each of the other flow intensitymeasurements.

At step 606, each of the flow intensity measurements that have arelationship with the reference measurement I_(i) is grouped with thereference measurement I_(i) into a cluster W_(s) associated with thereference measurement I_(i). This step is shown at 656 of the pseudocode of FIG. 6B. As illustrated at 656 of the pseudo code of FIG. 6B,the fitness score F₁(θ) is calculated for each model θ learned. Thefitness score F₁(θ) is compared to the fitness threshold {tilde over(F)} for each model, to determine whether the reference measurementI_(i) has a relationship which each of the other flow intensitymeasurements in C. If a flow intensity measurement is determined to havea relationship with the reference measurement I_(i), the flow intensitymeasurement is included with the reference measurement I_(i) in thecluster W_(s), and the model θ of the relationship between the flowintensity measurement and the reference measurement I_(i) is included ina set of invariant candidates M₁.

At step 608, relationships between each of the flow intensitymeasurements in the cluster W_(s) are modeled. Since all flow intensitymeasurements in the cluster W_(s) have linear relationships with thereference measurement I_(i), they are likely to have linearrelationships between each other. Conversely, since the remaining flowintensity measurements C−W_(s) do not have linear relationships withreference measurement I_(i), they are unlikely to have linearrelationships with other measurements in the cluster W_(s) of thereference measurement I_(i). Thus, the SmartMesh method does not searchany relationships between flow intensity measurements that belong todifferent clusters. This step is shown at 658 of the pseudo code of FIG.6B. As illustrated at 658 of the pseudo code of FIG. 6B, the models ofthe relationships between each of the flow intensity measurementsincluded in the cluster W_(s) are learned using Equation (7). Thus,these models are calculated directly based on monitoring data. Thefitness score of each model is calculated using Equation (8), andcompared to the fitness threshold {tilde over (F)}. Each model for whichthe fitness score is greater than the fitness threshold {tilde over(F)}, is included in the set of invariant candidates M₁.

At step 610, it is determined whether there are any remaining flowintensity measurements that have not been grouped into a cluster. If anyflow intensity measurements remain that have not been grouped into acluster, the SmartMesh method proceeds to step 612. If there are noremaining flow intensity measurements, the SmartMesh algorithm proceedsto step 614. At step 612, the set C_(s+1) for a next iteration isdefined as the remaining flow intensity measurements C_(s)−W_(s), andsteps 602-610 are repeated for the remaining flow intensity measurementsC_(s+1). Thus, the method is repeated until all of the flow intensitymeasurements are grouped into clusters of related measurements. Notethat Σ_(s)|W_(s)|=|C|=n, where |·| represents the size of a set. If oneflow intensity measurement does not have any relationships (with a highenough fitness score) with any other measurements in C, this single flowintensity measurement is considered to be a cluster with size equal toone.

At step 614, when there are no remaining flow intensity measurements tobe grouped into clusters, the set of models M₁ is output. This setrepresents a set of invariant candidates which can be validated usingPart II of the FullMesh method described above.

FIG. 7 illustrates an example of how the SmartMesh method works. Asillustrated in FIG. 7, a set of flow intensity measurements includesfive flow intensity measurements C={a,b,c,d,e}. In a first iteration ofthe SmartMesh method, measurement a is randomly selected as thereference measurement. After modeling relationships with each of theother four measurements (4 searches), it is determined that a hasrelationships with b and c. Thus, a, b, and c are grouped together inthe cluster {a,b,c}. Since both b and c have linear relationships witha, they are likely to have a linear relationship with each other, so amodel the relationship between b and c is calculated. Conversely, {d,e}do not have any linear relationships with a, so they are unlikely tohave linear relationships with {b,c}. Therefore, there is no search forany relationships between {d,e} and {b,c}. In a second iteration of theSmartMesh method, measurement e is randomly selected as the referencemeasurement from the set C₂={d,e}. After another search, it isdetermined that e has a relationship with d, and C₃ is empty after thefirst two iterations.

For further understanding of the SmartMesh method, consider a trianglerelationship among three measurements [a,b,c]. In one case, assume thata has linear relationships with b and c, i.e. a=f(b) and c=g(a) wheref(.) and g(.) are linear functions as shown in Equation (1). Thereforewe can conclude the linear relationship between b and c as c=g(f(b)). Asdescribed above, in each relationship search, two models are constructedwith reverse input and output between two flow intensity measurements.If both functions f(.) and g(.) are linear, then the function g(f(.)) isalso linear. However, since modeling errors exist in both a=f(b) andc=g(a), the derived equation c=g(f(b)) may precisely characterize therelationship between b and c. Therefore, as illustrated in FIGS. 6A and6B, the SmartMesh method models their relationship directly withobserved data.

In another case, assume that a has a relationship with b but not c. Inthis case, it is unlikely that b and c will have a strong relationship.Otherwise, as analyzed in the first case, theoretically we couldconclude a linear relationship between a and c, which contradicts theassumption in this case. However, in practice, there may exist modelingerrors in these equations. A threshold {tilde over (F)} can be selected,and Equation (9) can be used to determine whether two measurements havea relationship or not. For the second case, if the correlation between aand c has a fitness score slightly below the threshold, the correlationbetween b and c could still have a fitness score slightly higher thanthe threshold. Therefore, due to modeling errors ε in Equation (1), inpractice the second case could still happen in rare occurrences. Sincethe SmartMesh method does not search relationships between measurementsthat belong to different clusters, it may not discover as manyinvariants as FullMesh method does. Accordingly, the SmartMesh methodapproximates the results of the FullMesh method.

With minimal loss of accuracy, the SmartMesh method significantlyreduces the computational complexity of an invariant search. Thecomputational complexity of Part I of the FullMesh method can be denotedas T₁, and includes n(n−1)/2 searches, i.e., T₁=n (n−1)/2. Thecomputational complexity of the SmartMesh method can be denoted as T₂.Given n measurements, assume that the set of measurements C can be splitinto K clusters W_(s)(1≦s≦K), as illustrated in FIGS. 6A and 6B. Forconvenience, the size of cluster W_(s) can be denoted as m_(s), suchthat m_(s)={W_(s)}, and

${\sum\limits_{s = 1}^{K}m_{s}} = {{C} = {n.}}$

For the first iteration of the SmartMesh method, the randomly selectedreference measurement needs n−1 searches to discover its cluster W₁ andthen (m₁−1)(m₁−2)/2 searches are needed to discover invariants withinthis cluster W₁. Therefore, the computational complexity of the firstiteration can be calculated with the following equation:

$\begin{matrix}\begin{matrix}{{T_{2}(1)} = {n - 1 + \frac{\left( {m_{1} - 1} \right)\left( {m_{1} - 2} \right)}{2}}} \\{= {n - m_{1} + \frac{m_{1}\left( {m_{1} - 1} \right)}{2}}}\end{matrix} & (11)\end{matrix}$

For the second iteration of the SmartMesh method, the second referencemeasurement needs n−m₁1 searches to discover its cluster W₂ and then(m₂−1)(m₂−2)/2 searches are needed to discover invariants within thiscluster W₂. Therefore, the computational complexity of the second stepT₂(2), by replacing m₁ with m₂ and replacing n with n−m₁ in Equation(11). Thus, for all K steps, the computational complexity of theSmartMesh method can be calculated with the following equation:

$\begin{matrix}{\begin{matrix}{T_{2} = {\sum\limits_{s = 1}^{K}{T_{2}(s)}}} \\{= {{\sum\limits_{s = 1}^{K}\frac{m_{s}\left( {m_{s} - 1} \right)}{2}} + {\sum\limits_{s = 1}^{K}\left( {n - {\sum\limits_{j = 1}^{2}m_{j}}} \right)}}} \\{{= {{\sum\limits_{s = 1}^{K}\frac{m_{s}\left( {m_{s} - 1} \right)}{2}} + {\sum\limits_{s = 1}^{K}{\left( {s - 1} \right)m_{s}}}}},}\end{matrix}{{{where}\mspace{14mu} n} = {\sum\limits_{j = 1}^{K}{m_{j}.}}}} & (12)\end{matrix}$

The SmartMesh method is a randomized method because the sequential orderof clusters W_(s) is randomly determined by the selection of referencemeasurements, and the sequential order of cluster sizes could affect thecomplexity T₂ of the method significantly. Given a sequential order ofcluster size (m₁,m₂, . . . ,m_(K)), Equation (12) can be used tocalculate its computational complexity T₂(m₁,m₂, . . . ,m_(K)). For Kclusters, totally it is possible to have K! such sequential orders. Forconvenience, o_(i)(1≦i≦K!) can be used to represent a sequential orderof cluster sizes, i.e., o_(i)=Permutation({m_(s)}_(1≦s≦K)). Note thatthe sequential order of cluster sizes o_(i) only affects the second itemat the right side of Equation (12) but not the first item. Given nmeasurements that can be split onto K cluster sizes, if the clustersizes montonely decrease at each step, i.e., m₁≧m₂≧ . . . ≧m_(K), thisresults in the smallest T₂. Conversely, if the cluster sizes montonelyincrease at each step, we have the largest T₂.

Since the reference measurement is randomly selected at each step andlarger clusters include more measurements, the probability to choose areference measurement from a large cluster is higher than from a smallcluster. Therefore, in the average case, T₂ should be biased toward thesmallest T₂. Given a sequential order of cluster sizes m₁,m₂, . . .,m_(K)), we can calculate the probability of the sequential order ofclusters with the following equation:

$\begin{matrix}\begin{matrix}{{P\left( {m_{1},\ldots \mspace{11mu},m_{K}} \right)} = {{P\left( m_{1} \right)}{P\left( m_{2} \right)}\left( m_{1} \right)\mspace{11mu} \ldots \mspace{11mu} {P\left( {{m_{K}\text{}m_{1}},\ldots \mspace{11mu},m_{K - 1}} \right)}}} \\{= {\frac{m_{1}}{n}\frac{m_{2}}{n - m_{1}}\mspace{11mu} \ldots \mspace{11mu} \frac{m_{K}}{m_{K}}}} \\{= {\prod\limits_{j = 1}^{K}\frac{m_{j}}{n - {\sum\limits_{i = 1}^{j - 1}m_{i}}}}}\end{matrix} & (13)\end{matrix}$

Based one Equations (12) and (13), the computational complexity of theaverage case can be calculated with the following equation:

$\begin{matrix}{{{E\left( T_{2} \right)} = {\sum\limits_{i = 1}^{K!}{{T_{2}\left( o_{i} \right)}{P\left( o_{i} \right)}}}},} & (14)\end{matrix}$

where T₂(o_(i)) and P(o_(i)) are given in Equation (12) and Equation(13) respectively.

In all cases of the SmartMesh method, T₂≦T₁. Given a sequential order ofcluster sizes (m₁,m₂, . . . m_(K)), the reduction of computationalcomplexity can be calculated with the following equation:

$\begin{matrix}\begin{matrix}\left. {{T_{1} - T_{2}} = {\frac{n\left( {n - 1} \right)}{2} - {\sum\limits_{s = 1}^{K}\frac{m_{s}\left( {m_{s} - 1} \right)}{2}} - {\sum\limits_{s = 1}^{K}n} - {\sum\limits_{j = 1}^{s}m_{j}}}} \right) \\{= {\frac{n\left( {n - 1} \right)}{2} - {\sum\limits_{s = 1}^{K}\frac{m_{s}^{2}}{2}} + {\sum\limits_{s = 1}^{K}\frac{m_{s}}{2}} - {\sum\limits_{s = 1}^{K}\left( {n - {\sum\limits_{j = 1}^{2}m_{j}}} \right)}}} \\{= {\frac{n^{2}}{2} - {\sum\limits_{s = 1}^{K}\frac{m_{3}^{2}}{2}} - {\sum\limits_{s = 1}^{K}\left( {n - {\sum\limits_{j = 1}^{s}m_{j}}} \right)}}} \\{= {\frac{\left( {\sum\limits_{s = 1}^{K}m_{s}} \right)^{2} - {\sum\limits_{s = 1}^{K}m_{s}^{2}}}{2} - {\sum\limits_{s = 1}^{K}\left( {n - {\sum\limits_{j = 1}^{s}m_{j}}} \right)}}} \\{= {{\sum\limits_{s = 1}^{K}\left\lbrack {m_{s}\left( {- {\sum\limits_{j = 1}^{s}m_{j}}} \right)} \right\rbrack} - {\sum\limits_{s = 1}^{K}\left( {n - {\sum\limits_{j = 1}^{s}m_{j}}} \right)}}} \\{= {\sum\limits_{s = 1}^{K}{\left( {m_{s} - 1} \right){\left( {n - {\sum\limits_{j = 1}^{s}m_{j}}} \right).}}}}\end{matrix} & (15)\end{matrix}$

Since m_(s)≧1 and s≦K,

$n = {{\sum\limits_{j = 1}^{K}m_{j}} \geq {\sum\limits_{j = 1}^{s}m_{j}}}$

in all cases. There we should always have T₁−T₂>0. The value of T₁−T₂ isdependent on the sequential order of cluster sizes. For example,consider two special cases. In the first case, the set of n measurementsis split into K=n clusters with the size of each cluster m_(s)=1 (i.e.,there is no relationship between any measurements). According toEquation (15), T₁=T₂ for the first case. In the second case, the set ofn measurements is split into K=1 cluster with the size of this clusterm₁=n (i.e., there is a relationship between any pair of measurements).In the second case, T₁=T₂ as well. According to Equation (15), for allother cases where 1<m₁<n, T₁>T₂. Thus, the SmartMesh methodsignificantly reduces the computational complexity for invariantsearching, as compared to the FullMesh method.

FIG. 8A illustrates a method of modeling relationships between flowintensity measurements in a distributed system according to anotherembodiment of the present invention. FIG. 8B is pseudo code of themethod of FIG. 8A. One skilled in the art will readily be able toassociate portions of the pseudo code of FIG. 8B with the steps of FIG.8A. For convenience, the method of FIGS. 8A and 8B is referred to hereinas the “SimpleTree” method.

As illustrated in FIG. 8A, steps 802, 804, 806, 810, 812, and 814 aresubstantially the same as, steps 602, 604, 606, 610, 612, and 614 ofFIG. 6A, and the descriptions of these steps are not repeated. At step808, the SimpleTree method derives a mathematical model between each ofthe flow intensity measurements in a cluster W_(s) based on therelationship between the reference measurement I_(i) and each of theflow intensity measurements in the cluster W_(s). For example, considera triangle relationship among three measurements {x,y,z} As time series,assume that x has already built models with y and z. In the SmartMeshmethod of FIG. 6A, monitoring data is used to learn a model between yand z. However, the SimpleTree algorithm directly derives a mathematicalrelationship between y and z without using any monitoring data.

Introducing the backward shift operator q⁻¹ such that q⁻¹y(t)=y(t−1),the ARX model in Equation (1) can be rewritten as:

$\begin{matrix}{{y(t)} = {{x(t)}{\frac{b_{0} + {b_{1}q^{- 1}} + \ldots + {b_{m}q^{- m}}}{1 + {a_{1}q^{- 1}} + \ldots + {a_{n}q^{- n}}}.}}} & (16)\end{matrix}$

Without loss of generality, we assume that k=0 in Equation (1). If k>0,we can set the first k coefficients b_(i)=0(0≦i≦k1). Lets denote a₀=1,A_(n)=(a₀,a₁, . . . ,a_(n)), B_(m)=(b₀,b₁, . . . ,b_(m)), andQ_(k)=(q⁰,q⁻¹, . . . ,q^(−k)). Note that A_(n) and B_(m) are coefficientvectors. According to the definition of θ in Equation (2), we shouldhave (1,θ^(T))=(A_(n),B_(m)). Therefore, Equation (16) can be rewrittenas:

$\begin{matrix}{{{y(t)} = {{x(t)}\frac{B_{m}Q_{m}^{T}}{A_{n}Q_{n}^{T}}}},} & (17)\end{matrix}$

where Q^(T) is the matrix transposition of Q. Similarly, assume that xand z have the following relationship:

$\begin{matrix}{{{x(t)} = {{z(t)}\frac{D_{\hat{m}}Q_{\hat{m}}^{T}}{C_{\hat{n}}Q_{\hat{n}}^{T}}}},} & (18)\end{matrix}$

where D_({circumflex over (m)}) and C_({circumflex over (n)}) arecoefficient vectors. As described above, in each invariant search twomodels are constructed with reverse input and output between two flowintensity measurements. Thus, based on Equation (17) and Equation (18),the following model between y and z can be derived:

$\begin{matrix}{{{y(t)} = {{z(t)}\frac{Q_{m}B_{m}^{T}D_{\hat{m}}Q_{\hat{m}}^{T}}{Q_{n}A_{n}^{T}C_{\hat{n}}Q_{\hat{n}}^{T}}}},} & (19)\end{matrix}$

where B_(m) ^(T)D_({circumflex over (m)}) is a (m+1)×({circumflex over(m)}+1) matrix and A_(n) ^(T)C_({circumflex over (n)}) is a Error!Objects cannot be created from editing field codes. matrix. Forconvenience, denote V=B_(m) ^(T)D_({circumflex over (m)}) and U=A_(n)^(T)C_({circumflex over (n)}).

According to the rules of polynomial multiplication, Equation (19) canbe rewritten with the following equation:

$\begin{matrix}{{{y(t)} = {{z(t)}\frac{F_{m + \hat{m}}Q_{m + \hat{m}}^{T}}{E_{n + \hat{n}}Q_{n + \hat{n}}^{T}}}},} & (20)\end{matrix}$

where F_(m+{circumflex over (m)}) and E_(n+{circumflex over (n)}) arem+{circumflex over (m)}+1 and n+{circumflex over (n)}+1 dimensionalcoefficient vectors, respectively. Denote the j^(th) (0≦j≦m+{circumflexover (m)}) element of F_(m+{circumflex over (m)}) asF_(m+{circumflex over (m)}) ^(j) and the i^(th) (0≦I≦n+{circumflex over(n)}) element of E_(n+{circumflex over (n)}) asE_(n+{circumflex over (n)}) ^(i). Based on Equation (19) and Equation(20), the following equations are derived:

$\begin{matrix}{{V = {B_{m}^{T}D_{\hat{m}}}},} & (21) \\{{U = {A_{n}^{T}C_{\hat{n}}}},} & (22) \\{{F_{m + \hat{m}}^{j} = {\sum\limits_{i = 1}^{j}V_{i,{j - i}}}},} & (23) \\{E_{n + \hat{n}}^{i} = {\sum\limits_{k = 0}^{l}{U_{k,{l - k}}.}}} & (24)\end{matrix}$

Note that for any i>m or j>{circumflex over (m)}, V_(i,j)=0, and for anyi>n or j>{circumflex over (n)}, U_(i,j)=0.

In the triangle relationship among flow intensity measurements {x,y,z},given two models with parameters (1,θ_(yx) ^(T))=(A_(n),B_(m)) and(1,θ_(xz) ^(T))=(C_({circumflex over (n)},D) _({circumflex over (m)})),the matrices U and V can be calculated using Equations (21) and (22). Amodel θ_(yz) can then be derived with (1,θ_(yz)^(T))=(E_(n+{circumflex over (n)}),F_(m+{circumflex over (m)})), whereF_(m+{circumflex over (m)}) and E_(n+{circumflex over (n)}) are computedusing Equations (23) and (24).

Step 808 of FIG. 8A is shown at 850 of the pseudo code of FIG. 8B. Asillustrated at 850 of the pseudo code of FIG. 8B, the models betweeneach of the flow intensity measurements in the cluster W_(s) is derivedfrom the relationships of the flow intensity measurements in the clusterW_(s) and the reference measurement I_(i) using Equations (21)-(24).Thus, after randomly selecting a reference measurement I_(i) (step 802)and defining its cluster W_(s) (steps 804 and 806), the SimpleTreemethod does not use Equation (7) to learn models for measurements in thesame cluster W_(s). Instead, it mathematically derives these models fromthe learned models that correlate the reference measurement I_(i) withthe other measurements in the cluster W_(s). Note that any twomeasurements in the same cluster form a triangle relationship with thereference measurement. Therefore, the SimpleTree method only needs tosearch a two-level tree structure for each cluster, whose root is thereference measurement.

As described above, a derived model may not be as accurate as a modeldirectly learned from observed monitoring data. The modeling errors ofthe two models used to infer the third model can convolve and then leadto a large error in the third model. In addition, the order of a derivedmodel may be is larger than that of a learned model. For example, whileθ_(yx) is an n+m dimensional vector and θ_(xz) is a {circumflex over(n)}+{circumflex over (m)} dimensional vector, the derived model θ_(xz)is an n+m+{circumflex over (n)}+{circumflex over (m)} dimensionalvector. However, in distributed information systems, the latency of userrequests ban be very short so that the orders of ARX models (such as[n,m] and [{circumflex over (n)},{circumflex over (m)}]) are typicallyvery small. Therefore, the order of a derived model (such as[n+{circumflex over (n)},m+{circumflex over (m)}]) will remainrelatively small even though it is larger than that of the other twomodels. Compared to the SmartMesh method, the SimpleTree method may losemore invariants because some derived models may not be accurate enoughto pass the sequential testing process (i.e., Part II of FullMeshmethod).

The computational complexity of the SimpleTree method can be denoted asT₃. Since the model orders such as [n,m] and [{circumflex over(n)},{circumflex over (m)}] are very small, the computational complexityof Equations (21)-(24) can be so small as to be ignored in the analysisof T₃. By following a similar analysis to T₂ described above, thefollowing equation can be used to calculate T₃:

$\begin{matrix}\begin{matrix}{T_{3} = {{\sum\limits_{s = 1}^{K}{T_{3}(s)}} = {n - 1 + n - m_{1} - 1 + \ldots + n - {\sum\limits_{s = 1}^{K - 1}m_{3}} - 1}}} \\{= {{{K\left( {n - 1} \right)} - {\sum\limits_{s = 1}^{K}{\left( {K - s} \right)m_{s}}}} = {{\sum\limits_{s = 1}^{K}{sm}_{s}} - K}}}\end{matrix} & (25)\end{matrix}$

T₃ is also dependent on the sequential order of cluster sizes{m_(s)}_(1≦s≦K). Given n measurements that can be split into K clusters,if the cluster sizes montonely decrease at each step i.e., m₁≧m₂≧ - - -≧m_(K), we have the smallest possible T₃. Conversely, if the clustersizes montonely increase at each step, we have the largest T₃ (the worstcase). For the average case, the expected value of T₃ is given by:

$\begin{matrix}{{{E\left( T_{3} \right)} = {\sum\limits_{i = 1}^{K!}{{T_{3}\left( o_{i} \right)}{P\left( o_{i} \right)}}}},} & (26)\end{matrix}$

where T₃(o_(i)) and P(o_(i)) are given by Equation (25) and Equation(13), respectively.

In practice, for a specific problem, it may not be known how K (thenumber of clusters) increases with the growth of n (the number ofmeasurements). If K is constant, the computational complexity of SimpleTree algorithm T₃ would be O(n); if K increases with O(lg(n)), T₃ wouldbe O(n lg(n)). Therefore, T₃ is not only dependent on the size of inputdata n but also the number of clusters K included in this data. However,compared to T₂, the reduction of computational complexity in T₃ can bederived with the following equation:

$\begin{matrix}\begin{matrix}\left. {{T_{2} - T_{3}} = {{\sum\limits_{s = 1}^{K}{T_{2}(s)}} - {T_{3}(s)}}} \right) \\{= {\sum\limits_{s = 1}^{K}\frac{\left( {m_{s} - 1} \right)\left( {m_{s} - 2} \right)}{2}}}\end{matrix} & (27)\end{matrix}$

Since m_(s)≧1(1≦s≦K), T₂≧T₃. If a set of n measurements is split intoK−n clusters with the size of each cluster m_(s)=1, according toEquation (27), then T_(s)=T₃. However, if the set of n measurements issplit into K=1 cluster with the size of this cluster m₁=n, thenT₂=n(n−1)/2 and T₃=n−1. In this case, we reduce the computationalcomplexity of invariant search from O(n²) to O(n). In general, accordingto Equation (27), the SimpleTree method significantly reduces thecomputational complexity, as compared to the SmartMesh method, althoughthe SimpleTree method may be less accurate than the SmartMesh method.

As discussed above, any two measurements in the same cluster WS form atriangle relationship with their reference measurement and in thisrelationship we use two leaned models to derive the third one. For somesystem management tasks, the third model may only provide “redundant”information because it can be derived from the other two modelsdirectly. In this case, we may only need to search the tree structure ofinvariants without any model derivation in SimpleTree algorithm.However, for some system management tasks such as fault isolation, thesederived models could be used to isolate faults more precisely. Forexample, in a triangle relationship, if we observe one of the twolearned models is violated due to fault occurrences, we can notdetermine which measurement among the two measurements associated withthe broken model is affected. Instead, if we have three models and twomodels are violated, it is possible to determine which measurement ismost likely affected. An example is given in FIG. 9 to illustrate suchgraph-based reasoning in fault isolation. FIG. 9 illustrates two sets ofmodels 902 and 904. In the case of 902, models relate flow intensitymeasurements a and b, and a and c, and the model between a and c isbroken. When a fault is detected in the model between a and c, it ispossible that a or c is the flow intensity measurement affected by thefault. In the case of 904, models relate flow intensity measurements, aand b, a and c, and b and c, and the models between a and c and b and care broken. Thus, it is clear that mc is the flow intensity measurementaffected by the fault.

The steps of the methods described herein may be performed by computerscontaining processors which are executing computer program code whichdefines the functionality described herein. Such computers are wellknown in the art, and may be implemented, for example, using well knowncomputer processors, memory units, storage devices, computer software,and other components. A high level block diagram of such a computer isshown in FIG. 10. Computer 1002 contains a processor 1004 which controlsthe overall operation of computer 1002 by executing computer programinstructions which define such operation. The computer programinstructions may be stored in a storage device 1012 (e.g., magneticdisk) and loaded into memory 1010 when execution of the computer programinstructions is desired. Thus, the operation of computer 1002 is definedby computer program instructions stored in memory 1010 and/or storage1010 and the computer 1002 will be controlled by processor 1004executing the computer program instructions. Computer 1002 also includesone or more network interfaces 906 for communicating with other devicesvia a network. Computer 1002 also includes input/output 1008 whichrepresents devices which allow for user interaction with the computer1002 (e.g., display, keyboard, mouse, speakers, buttons, etc.). Oneskilled in the art will recognize that an implementation of an actualcomputer will contain other components as well, and that FIG. 10 is ahigh level representation of some of the components of such a computerfor illustrative purposes. One skilled in the art will also recognizethat the functionality described herein may be implemented usinghardware, software, and various combinations of hardware and software.

The foregoing Detailed Description is to be understood as being in everyrespect illustrative and exemplary, but not restrictive, and the scopeof the invention disclosed herein is not to be determined from theDetailed Description, but rather from the claims as interpretedaccording to the full breadth permitted by the patent laws. It is to beunderstood that the embodiments shown and described herein are onlyillustrative of the principles of the present invention and that variousmodifications may be implemented by those skilled in the art withoutdeparting from the scope and spirit of the invention. Those skilled inthe art could implement various other feature combinations withoutdeparting from the scope and spirit of the invention.

1. A method for modeling invariant relationships among a plurality offlow intensity measurements in a distributed system, comprising: (a)randomly selecting one measurement from the plurality of flow intensitymeasurements; (b) searching for a relationship between said onemeasurement and each remaining one of the plurality of flow intensitymeasurements; (c) grouping said one measurement and each one of theplurality of flow intensity measurements having a relationship with saidone measurement into a cluster; and (d) determining a relationshipbetween a first flow intensity measurement in the cluster and a secondflow intensity measurement in the cluster, both the first and the secondmeasurements being other than said one measurement.
 2. The method ofclaim 1, further comprising: (e) iteratively repeating steps (a)-(d) forremaining ones of the plurality of flow intensity measurements notgrouped into a cluster until each of the remaining ones of the pluralityof flow intensity measurements is grouped into a cluster.
 3. The methodof claim 2, wherein step (d) comprises: determining relationshipsbetween all of the flow intensity measurements in the cluster.
 4. Themethod of claim 1, wherein step (b) comprises: calculating a model of arelationship between said one measurement and each remaining one of theplurality of flow intensity measurements; calculating a fitness scorefor each model; and comparing the fitness score for each model to athreshold to determine which of the remaining ones of the plurality offlow intensity measurements have a relationship with said onemeasurement.
 5. The method of claim 4, wherein said step of calculatinga model of a relationship between said one measurement and eachremaining one of the plurality of flow intensity measurements comprises:modeling the relationship between said one measurement and eachremaining one of the plurality of flow intensity measurements using anAutoRegressive model with eXogenous inputs (ARX).
 6. The method of claim1, wherein step (d) comprises: calculating a model of a relationshipbetween each of the flow intensity measurements grouped into the clusterbased on observed monitoring data.
 7. The method of claim 6, whereinsaid step of calculating a model of a relationship between each of theflow intensity measurements grouped into the cluster comprises: modelingthe relationship between each of the flow intensity measurements groupedinto the cluster using an AutoRegressive model with eXogenous inputs(ARX).
 8. The method of claim 6, wherein step (d) further comprises:calculating a fitness score for each model; and comparing the fitnessscore for each model with a threshold to determine whether each modelaccurately represents a relationship between flow intensitymeasurements.
 9. The method of claim 1, wherein step (d) comprises:deriving a relationship between each of the flow intensity measurementsgrouped into a cluster based on the relationships between said onemeasurement and each of the flow intensity measurements grouped into acluster.
 10. The method of claim 9, wherein step (d) further comprises:calculating a fitness score for each derived relationship between eachof the flow intensity measurements grouped into the cluster; andcomparing the fitness score for each derived relationship with athreshold to determine whether each derived relationship accuratelyrepresents a relationship between flow intensity measurements.
 11. Themethod of claim 1, further comprising: sequentially validatingrelationships between flow intensity measurements determined in steps(b) and (d) with observed flow intensity measurements over a period oftime to determine whether the relationships are invariant.
 12. Themethod of claim 11, further comprising: tracking changes in invariantrelationships for fault detection and isolation.
 13. A method for faultdetection in a distributed system comprising: (a) randomly selecting onemeasurement from the plurality of flow intensity measurements; (b)searching for a relationship between said one measurement and eachremaining one of the plurality of flow intensity measurements; (c)grouping said one measurement and each one of the plurality of flowintensity measurements having a relationship with said one measurementinto a cluster; and (d) determining relationships between each of theflow intensity measurements other than said one measurement grouped intothe cluster; (e) iteratively repeating steps (a)-(d) for remaining onesof the plurality of flow intensity measurements not grouped into acluster until each of the remaining ones of the plurality of flowintensity measurements is grouped into a cluster; (f) validatingrelationships determined between ones of the plurality of flow intensitymeasurements to determine whether the relationships are invariantrelationships; (g) predicting flow intensities based on therelationships determined to be invariant relationships; and (h)detecting faults in the distributed system by comparing measured flowintensity measurements to the predicted flow intensities.
 14. A computerreadable medium storing computer program instructions for performing amethod for modeling invariant relationships from a plurality of flowintensity measurements in a distributed system, the computer programinstructions defining the steps comprising: (a) randomly selecting onemeasurement from the plurality of flow intensity measurements; (b)searching for a relationship between said one measurement and eachremaining one of the plurality of flow intensity measurements; (c)grouping said one measurement and each one of the plurality of flowintensity measurements having a relationship with said one measurementinto a cluster; and (d) determining a relationship between a first flowintensity measurement in the cluster and a second flow intensitymeasurement in the cluster, both the first and the second measurementsbeing other than said one measurement.
 15. The computer readable mediumof claim 14, further comprising computer program instructions definingthe step of: (e) iteratively repeating steps (a)-(d) for remaining onesof the plurality of flow intensity measurements not grouped into acluster until each of the remaining ones of the plurality of flowintensity measurements is grouped into a cluster.
 16. The computerreadable medium of claim 14, wherein the computer program instructionsdefining step (d) comprise computer program instructions defining thestep of: determining relationships between all of the flow intensitymeasurements in the cluster.
 17. The computer readable medium of claim14, wherein the computer program instructions defining step (b) comprisecomputer program instructions defining the steps of: calculating a modelof a relationship between said one measurement and each remaining one ofthe plurality of flow intensity measurements; calculating a fitnessscore for each model; and comparing the fitness score for each model toa threshold to determine which of the remaining ones of the plurality offlow intensity measurements have a relationship with said onemeasurement.
 18. The computer readable medium of claim 17, wherein thecomputer program instructions defining the step of calculating a modelof a relationship between said one measurement and each remaining one ofthe plurality of flow intensity measurements comprise computer programinstructions defining the step of: modeling the relationship betweensaid one measurement and each remaining one of the plurality of flowintensity measurements using an AutoRegressive model with eXogenousinputs (ARX).
 19. The computer readable medium of claim 14, wherein thecomputer program instructions defining step (d) comprise computerprogram instructions defining the step of: calculating a model of arelationship between each of the flow intensity measurements groupedinto the cluster based on observed monitoring data.
 20. The computerreadable medium of claim 16, wherein the computer program instructionsdefining the step of calculating a model of a relationship between eachof the flow intensity measurements grouped into the cluster comprisecompute program instructions defining the step of: modeling therelationship between each of the flow intensity measurements groupedinto the cluster using an AutoRegressive model with eXogenous inputs(ARX).
 21. The computer readable medium of claim 16, wherein thecomputer program instructions defining step (d) further comprisecomputer program instructions defining the steps of: calculating afitness score for each model; and comparing the fitness score for eachmodel with a threshold to determine whether each model accuratelyrepresents a relationship between flow intensity measurements.
 22. Thecomputer readable medium of claim 14, wherein the computer programinstructions defining step (d) comprise computer program instructionsdefining the step of: deriving a relationship between each of the flowintensity measurements grouped into the cluster based on therelationships between said one measurement and each of the flowintensity measurements grouped into a cluster.
 23. The computer readablemedium of claim 22, wherein the computer program instructions definingstep (d) further comprise computer program instructions defining thesteps of: calculating a fitness score for each derived relationshipbetween each of the flow intensity measurements grouped into thecluster; and comparing the fitness score for each derived relationshipwith a threshold to determine whether each derived relationshipaccurately represents a relationship between flow intensitymeasurements.
 24. The computer readable medium of claim 14, furthercomprising computer program instructions defining the step of:sequentially validating relationships between flow intensitymeasurements determined in steps (b) and (d) with observed flowintensity measurements over a period of time to determine whether therelationships are invariant.
 25. The computer readable medium of claim24, further comprising computer program instructions defining the stepof: tracking changes in invariant relationships for fault detection andisolation.
 26. A system for modeling invariant relationships from aplurality of flow intensity measurements in a distributed system,comprising: means for randomly selecting one measurement from theplurality of flow intensity measurements; means for searching for arelationship between said one measurement and each remaining one of theplurality of flow intensity measurements; means for grouping said onemeasurement and each one of the plurality of flow intensity measurementshaving a relationship with said one measurement into a cluster; and fordetermining relationships between flow intensity measurements other thansaid one measurement grouped into the cluster.
 27. The system of claim26, wherein said means for searching for a relationship between said onemeasurement and each remaining one of the plurality of flow intensitymeasurements comprises: means for calculating a model of a relationshipbetween said one measurement and each remaining one of the plurality offlow intensity measurements; means for calculating a fitness score foreach model; and means for comparing the fitness score for each model toa threshold to determine which of the remaining ones of the plurality offlow intensity measurements have a relationship with said onemeasurement.
 28. The system of claim 27, wherein said means forcalculating a model of a relationship between said one measurement andeach remaining one of the plurality of flow intensity measurementscomprises: means for modeling the relationship between said onemeasurement and each remaining one of the plurality of flow intensitymeasurements using an AutoRegressive model with eXogenous inputs (ARX).29. The system of claim 26, wherein said means for determiningrelationships between flow intensity measurements other than said onemeasurement grouped into the cluster comprises: means for calculating amodel of a relationship between each of the flow intensity measurementsgrouped into the cluster based on observed monitoring data.
 30. Thesystem of claim 26, wherein said means for determining relationshipsbetween flow intensity measurements other than said one measurementgrouped into the cluster comprises: means for deriving a relationshipbetween each of the flow intensity measurements grouped into a clusterbased on the relationships between said one measurement and each of theflow intensity measurements grouped into a cluster.
 31. The system ofclaim 26, further comprising: means for sequentially validatingrelationships between flow intensity measurements with observed flowintensity measurements over a period of time to determine whether therelationships are invariant.
 32. The system of claim 31, furthercomprising: means for tracking changes invariant relationships for faultdetection and isolation.